How to Create a Boxplot
A boxplot (also called a box-and-whisker plot) is a graphical representation of the distribution of a dataset. It provides a summary using the five-number summary and helps identify outliers. In this lesson, we define a boxplot and outline the steps to construct one.
Boxplot
What is a Boxplot?
A boxplot visually displays the spread and skewness of a dataset using a box and two line segments (the whiskers). It is built using the five-number summary.
- a vertical line segment is drawn through \(Q_1\), \(Q_2\), and \(Q_3\). Connect the ends of the lines to create a box.
- A line is drawn from the center of the side defined by \(Q_1\) to the minimum value.
- A line is drawn from the center of the side defined by \(Q_3\) to the maximal value.
Some boxplots also display outliers, but the version we are constructing does not do that.
How to Draw a Boxplot
Steps to Construct a Boxplot
Follow these steps to create a boxplot:
- Find the five-number summary.
- Draw the box: Plot a box from \( Q_1 \) to \( Q_3 \), with a vertical line at the median.
- Draw the whiskers: Extend lines from the box to the minimum and maximum values.
Example 1
This dataset represents the length of gaming sessions (in minutes) for 25 gamers who played an online multiplayer game over the weekend. Using the dataset below, construct a boxplot of gaming session lengths.
Session Lengths (Minutes) | ||||
---|---|---|---|---|
15 | 30 | 45 | 60 | 75 |
120 | 150 | 90 | 200 | 180 |
95 | 110 | 130 | 140 | 85 |
70 | 160 | 170 | 55 | 40 |
190 | 210 | 35 | 100 | 250 |
Solution
First, lets copy the data into the Summary Statistics Calculator and select the
boxes for the Minimum Value, \(Q_1\), Median, \(Q_3\) and Maximum
Value:
Therefore, we get \[\min = 15\quad Q_1=57.5\quad \text{{Median}}=100\quad
Q_3=165\quad\max=250\] First, we will draw a number line, labeling the five-number
summary to scale on the axis. Next, we will draw a vertical line above \(Q_1\),
the median, and \(Q_3\).
Next, draw in the box.
Draw the left whisker from the box's center left edge to the
minimum value.
Draw
the right whisker from the box's center right edge to the maximum value.
$$\tag*{\(\blacksquare\)}$$
Example 2
During exam week, researchers recorded the number of cups of coffee consumed daily by 30 college students. Use the Boxplot Generator to construct a boxplot for the coffee consumption.
Number of Cups | |||||||||
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 5 | 3 | 4 | 6 | 7 | 2 |
3 | 4 | 5 | 1 | 0 | 8 | 3 | 6 | 4 | 5 |
2 | 7 | 3 | 4 | 6 | 2 | 1 | 5 | 3 | 4 |
Solution
Copy the data, open it in the Boxplot Generator, copy the data into the
spreadsheet, and close the spreadsheet. The boxplot should automatically generate
in the tool.
$$\tag*{\(\blacksquare\)}$$
Skewness in Boxplots
How to Identify Skewness in a Boxplot
A boxplot provides insight into whether a dataset is normally distributed, skewed left, or skewed right by examining the position of the median and the length of the whiskers.
- A boxplot is normally distributed if the median is centered within
the box (between \( Q_1 \) and \( Q_3 \)) and the whiskers on both sides are
approximately equal in length.
- A boxplot is left skewed if either the median is closer to \( Q_3
\), or the left whisker is longer than the right whisker.
- A boxplot is right skewed if either the median is closer
to \( Q_1 \), or the right whisker is longer than the left whisker.
Comparing Datasets Using Boxplots
The Advantages of Boxplots for Comparison
Boxplots are an excellent tool for comparing multiple datasets because they provide a visual summary of key statistical measures while maintaining simplicity. Here’s why boxplots are useful for comparing data:
Side-by-Side Comparison of Distributions
Boxplots allow multiple datasets to be displayed together, making it easy to compare their centers, spreads, and skewness. This is particularly useful when comparing groups, such as test scores across different schools or income levels across regions.
Quick Insight into Variability
The interquartile range (IQR), represented by the width of the box, gives an immediate understanding of how spread out the middle 50% of the data is. If one dataset has a wider box than another, it has greater variability.
Easy Detection of Skewness
By examining the median’s position inside the box and the relative whisker lengths, boxplots make it easy to see if a dataset is symmetrical, right-skewed, or left-skewed. This helps in understanding differences between distributions, such as salary distributions in different industries.
Compact Yet Informative
Boxplots do not require a large amount of space and can be used effectively in reports or presentations to compare datasets quickly. They condense information into a single, easy-to-read visual while still capturing essential details about the data.
Example 3
These datasets represent the 100-meter sprint times (in seconds) for two groups: Olympic sprinters and high school sprinters. Use the Boxplot Generator for each dataset and compare their distributions.
High School (s) | Olympic (s) | ||||||||
---|---|---|---|---|---|---|---|---|---|
10.55 | 10.60 | 10.65 | 10.72 | 10.78 | 9.58 | 9.69 | 9.72 | 9.76 | 9.81 |
10.82 | 10.85 | 10.89 | 10.94 | 10.98 | 9.85 | 9.88 | 9.91 | 9.93 | 9.95 |
11.02 | 11.07 | 11.10 | 11.14 | 11.18 | 9.98 | 10.01 | 10.03 | 10.05 | 10.08 |
11.21 | 11.25 | 11.29 | 11.35 | 11.40 | 10.12 | 10.15 | 10.19 | 10.22 | 10.25 |
Solution
Copy the data, open the Boxplot Generator, paste the data into the spreadsheet, and close the spreadsheet. The plot for the high school sprint times should appear automatically.
To generate the second boxplot, change the number of datasets to Two, and set the second data set to column B. Then, the second plot will appear.
As expected, the student times are slower than the Olympic times, but also notice that
- the Olympic times are slightly skewed left while the high school students have a more normal distribution, which means there are a few Olympiads that significantly outperform the other athletes,
- that the IQR for high school students is wider than for Olympiads; meaning that high school students have more variability in their times.
$$\tag*{\(\blacksquare\)}$$
Conclusion
Boxplots provide a visual summary of a dataset, highlighting its spread, skewness, and potential outliers. They are especially useful when comparing multiple populations such as high school and Olympic athletes that have their own levels of performance but a comparison of ability is still warranted.